Distant speech separation using predicted time-frequency masks from spatial features
نویسندگان
چکیده
Speech separation algorithms are faced with a difficult task of producing high degree of separation without containing unwanted artifacts. The time-frequency (T-F) masking technique applies a real-valued (or binary) mask on top of the signal’s spectrum to filter out unwanted components. The practical difficulty lies in the mask estimation. Often, using efficient masks engineered for separation performance leads to presence of unwanted musical noise artifacts in the separated signal. This lowers the perceptual quality and intelligibility of the output. Microphone arrays have been long studied for processing of distant speech. This work uses a feed-forward neural network for mapping microphone array’s spatial features into a T-F mask. Wiener filter is used as a desired mask for training the neural network using speech examples in simulated setting. The T-F masks predicted by the neural network are combined to obtain an enhanced separation mask that exploits the information regarding interference between all sources. The final mask is applied to the delay-and-sum beamformer (DSB) output. The algorithm’s objective separation capability in conjunction with the separated speech intelligibility is tested with recorded speech from distant talkers in two rooms from two distances. The results show improvement in instrumental measure for intelligibility and frequency-weighted SNR over complex-valued non-negative matrix factorization (CNMF) source separation approach, spatial sound source separation, and conventional beamforming methods such as the DSB and minimum variance distortionless response (MVDR).
منابع مشابه
Speaker separation using visually-derived binary masks
This paper is concerned with the problem of single-channel speaker separation and exploits visual speech information to aid the separation process. Audio from a mixture of speakers is received from a single microphone and to supplement this, video from each speaker in the mixture is also captured. The visual features are used to create a time-frequency binary mask that identifies regions where ...
متن کاملMulti-channel speech separation with soft time-frequency masking
This paper addresses the problem of separating concurrent speech through a spatial filtering stage and a subsequent time-frequency masking stage. These stages complement each other by first exploiting the spatial diversity and then making use of the fact that different speech signals rarely occupy the same frequency bins at a time. The novelty of the paper consists in the use of auditorymotivat...
متن کاملCombined Multi-Channel NMF-Based Robust Beamforming for Noisy Speech Recognition
We propose a novel acoustic beamforming method using blind source separation (BSS) techniques based on non-negative matrix factorization (NMF). In conventional mask-based approaches, hard or soft masks are estimated and beamforming is performed using speech and noise spatial covariance matrices calculated from masked noisy observations, but the phase information of the target speech is not adeq...
متن کاملOn the Window-disjoint-orthogonality of Speech Sources in Reverberant Humanoid Scenarios
Many speech source separation approaches are based on the assumption of orthogonality of speech sources in the time-frequency domain. The target speech source is demixed from the mixture by applying the ideal binary mask to the mixture. The time-frequency orthogonality of speech sources is investigated in detail only for anechoic and artificially mixed speech mixtures. This paper evaluates how ...
متن کاملAdaptive beamforming and soft missing data decoding for robust speech recognition in reverberant environments
This paper presents a novel approach to combine microphone array processing and robust speech recognition for reverberant multi-speaker environments. Spatial cues are extracted from a microphone array and automatically clustered to estimate localization masks in the time-frequency domain. The localization masks are then used to blindly design adaptive filters in order to enhance the source sign...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Speech Communication
دوره 68 شماره
صفحات -
تاریخ انتشار 2015